R을 이용한 데이터 전처리와 시각화 기초 코스

Author

waterfirst

1 Introduction

이 수업은 코딩을 전혀 모르는 사람들을 대상으로 숫자로 된 데이터를 적절히 칼질하여 요리할 수 있도록 하는 것을 목적으로 만들었습니다.

반복적으로 정형화된 데이터를 처리하고 그래프를 그리는 연구원들은 최종적으로는 자신의 결과물을 알기 쉽게 표현하는 것입니다.

이를 위해 tidyverse 패키지 하나만으로 얼마나 쉽게 데이터를 다룰 수 있는지 R의 장점이 무엇인지 알 수 있는 시간이 될 것입니다.

R 언어 간단 소개

두명의 뉴질랜드 통계학자가 만듦 : 로버트 젠틀맨(Robert Gentleman)과 로스 이하카(Ross Ihaka)

해들리 위컴에 의해 빅데이터 툴로 발전함 (대표적 : ggplot, tidyverse)

언어의 특징

1부터 시작 (다른 언어들은 0부터 시작)

패키지 설치, 불러오기

  • install.packages(“패키지이름”)

  • library(패키지이름)

프로그램 구분

Back end를 담당하는 데이터 전처리 및 시각화는 tidyverse 패키지를 이용하여 진행하고 필요할 경우, 추가 패키지를 이용할 것입니다.

실전에서 바로 쓸 수 있도록 기본 예제 데이터를 이용하여 학습하고 각자 자신의 자주 사용하는 데이터를 이용하여 반복 적으로 하던 일을 코딩을 통해 줄이고 더 창의적인 일에 시간을 쓸 수 있도록 4주 과정으로 만들었습니다. (주1회 2 ~3시간)


2 강의순서

  1. R 설치, 기본문법 (1주차)

    https://dplyr.tidyverse.org/articles/dplyr.html

  2. 데이터 전처리 문제 풀이 (2주차)

    https://m-clark.github.io/data-processing-and-visualization/intro.html

  3. 데이터 전처리 및 시각화 (3주차)

    https://r-graph-gallery.com/

  4. 다양한 데이터 시각화 연습 (4주차)

    2d, 3d 이미지화


3 강의전 사전 준비(프로그램 설치)

(#1~3까지 하고, #4~7은 나중에~~)

  1. R 설치 : https://posit.co/download/rstudio-desktop/

  2. RStudio 설치 https://posit.co/download/rstudio-desktop/

  3. Quarto CLI설치 : https://quarto.org/docs/download/

  4. Latex 설치 : (Rstudio 터미널창) $ quarto install tinytex

  5. 출판용 사이트 가입 : https://quartopub.com/

  6. github 가입 : https://github.com/

  7. git 설치 : https://git-scm.com/download/win

[Quarto ]https://quarto.org/docs/presentations/revealjs/

프로그램을 배울 때, 다운로드, 설치, 환경설정만 하면 50%는 이미 배운것입니다. ^^

RStudio 설명


4 Day1

  • 데이터 분석과 시각화를 하는데 R이 최선인가?

ex) 상용 프로그램 : 엑셀 , 미니탭, 오리진, 매트랩, 스팟파이어

오픈소스 : 파이썬, R

  • 왜 데이터 분석 및 시각화가 필요한가? GPT 시대인데…

  • 내가 하고 있는 분야에 데이터는 정형화된 데이터인가?(숫자) 아니면 비정형 데이터인가(문자)

  • 데이터 분석의 최종 목적은 무엇인가?


4-1. R Basic

1. 데이터 형식

숫자형(numeric) : num(숫자형), int(정수형), dbl(실수형)
문자형(character) : chr
범주형(factor) : fct
논리형(logical) : logi
결측 (Not Available) : NA
무한대 (Infinite) : Inf
데이터 형식 알아보기 : class(변수명) is.numeric(변수명), is.character(변수명), is.factor(변수명)
데이터 형식 바꾸기 : as.numeric(변수명), as.factor(변수명), as.character(변수명), as.logical(변수명)
Note

범주형 변수(factor) : 그래프를 그리거나 통계적 분석시 유용함

데이터를 열별로 모아 놓은 dataframe, tibble 이 실제 분석에 이용

list, matrix, array 형태도 있음

a <- c(1,2,3,4) : 숫자형 벡터 a <- c(“1”, “2”, “a”, “b”) : 문자형 벡터

단축키

<- : Alt + -

실행 : Ctrl + enter

|> : Ctrl + Shift + M

주석처리 : Ctrl + Shift + C

콘솔창 지우기 : Ctrl + L


2. 자주 사용 하는 함수

평균(mean) : mean(변수)
중위수(median) : median(변수)
최대값(max) : max(변수)
최소값(min) : min(변수)
합(sum) : sum(변수)
표준편차(sd) : sd(변수)
분산(var) : var(변수)
절대값(abs) : abs(변수)
반올림(round) : round(변수, 반올림할 소수점 아래수)
제곱근(sqrt) : sqrt(변수)
원소갯수, 문자열길이(length) : length(변수)
행, 열의 수(dim) : dim(df)
프린트(print) : print(변수) / print(“문자”)
조건(ifelse) : ifelse(x>10, “a”, “b”)
중복없이 관측치 종류(unique) : unique(변수)
문자패턴 찾기(grep, grepl) : grep(“문자”, df):열번호 출력, grepl(“문자”, df):true/false로 출력
문자패턴 찾아 바꾸기(gsub) : gsub(“이전문자”, “새로운 문자”, df)
열갯수(ncol) : ncol(df)
행갯수(nrow) : nrow(df)
열이름(colnames) : colnames(df)
행이름(colnames) : rownames(df)
빈도수 구하기(table) : table(변수)
정렬하기(sort) : 내림차순 sort(변수), 오름차순 sort(변수, decreasing = TRUE)
열이름(names, colnames) : names(변수)
최대, 최소위치 찾기(which.max, which.min) : which.max(변수), which.min(변수)
4-2. 데이터 탐색 기본 함수

head : 앞 6개 행 보기

tail : 뒤 6개 행 보기

summary : 기술 통계 간단히 보기

str : 데이터 형식 보기


3. 연산 기호

"
* (곱하기) : x*2
/ (나누기) : x/2
%/% (나눗셈의 몫) : 16%/%3 = 5
%% (나눗셈의 나머지) : 16%/%3 = 1
== (일치, True or False) : 3==5, False
!= (불일치) : 3!=5, True
& (and) : x > 2 & x < 10
| (or) : x < 2 | x > 10
"


4-3. Tidyverse

[참고 자료]https://rstudio.github.io/cheatsheets/html/data-transformation.html

%>% (파이프라인, 왼쪽 데이터프레임을 오른쪽 함수에 넣어라) : df %>% head()

filter (조건에 맞는 행 추출) : df %>% filter(컬럼명 == “a”)

select(특정열 선택) : df %>% select(열번호) / df[, 열번호]

slice(특정행 선택) : df %>% slice(행번호) / df[행번호, ]
mutate(특정열 추가) : df %>% mutate(새로운 열이름 = )
rename(열이름 바꾸기) : df %>% rename(새로운 열이름 = 이전 열이름)
arrange(정렬하기) : 오름차순 : df %>% arrange(열이름), 내림차순 : df %>% arrange(desc(열이름))

group_by(특정열 그룹화), summarise(통계치 계산) :

df %>% group_by(열이름) %>% summarise(평균=mean(열이름))
열합치기(inner_join, full_join, left_join, right_join) : inner_join(df1, df2, by=“name”)

separate(특정기호로 분리) : df %>% separate(열이름, into = c("a", "b"), sep = "_")

na가 있는 행 제거하기(na.omit) : na.omit(df)

na가 있는 열에서 na 는 제거하고 계산하기 (na.rm=T) : mean(df, na.rm=T)

열합치기(cbind, bind_cols) : cbind(df1, df2) or bind_cols(df1, df2)
행합치기(rbind, bind_rows) : rbind(df1, df2) or bind_rows(df1, df2)

중복없는 값 찾기(distinct) : df %>% distinct ("열이름")

행의 수 세기 : n(), count()

4-4. long_form, wide_form

iris data를 이용하여 꽃잎 길이, 넓이, 꽃받임 길이, 넓이를 long form으로 바꾸어보자.

Code
library(tidyverse)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
Code
iris |> pivot_longer(cols = Sepal.Length:Petal.Width, names_to = "measure", values_to = "value") |> head()
# A tibble: 6 × 3
  Species measure      value
  <fct>   <chr>        <dbl>
1 setosa  Sepal.Length   5.1
2 setosa  Sepal.Width    3.5
3 setosa  Petal.Length   1.4
4 setosa  Petal.Width    0.2
5 setosa  Sepal.Length   4.9
6 setosa  Sepal.Width    3  
Code
iris |> pivot_longer(cols = Sepal.Length:Petal.Width, 
                     names_to = c("name1", "name2"),
                     names_sep ='\\.') |> head()
# A tibble: 6 × 4
  Species name1 name2  value
  <fct>   <chr> <chr>  <dbl>
1 setosa  Sepal Length   5.1
2 setosa  Sepal Width    3.5
3 setosa  Petal Length   1.4
4 setosa  Petal Width    0.2
5 setosa  Sepal Length   4.9
6 setosa  Sepal Width    3  
Code
iris_long <- 
  iris |> pivot_longer(cols = Sepal.Length:Petal.Width, names_to = "measure", values_to = "value")


iris_long |> pivot_wider(
    names_from = measure,  values_from = value) |> unnest() |> head()
# A tibble: 6 × 5
  Species Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>          <dbl>       <dbl>        <dbl>       <dbl>
1 setosa           5.1         3.5          1.4         0.2
2 setosa           4.9         3            1.4         0.2
3 setosa           4.7         3.2          1.3         0.2
4 setosa           4.6         3.1          1.5         0.2
5 setosa           5           3.6          1.4         0.2
6 setosa           5.4         3.9          1.7         0.4


4-4. 연습문제

  • palmer penguin을 df에 넣고 앞 6개 행을 살펴보라.

    Code
    #install.packages("palmerpenguins")
    library(palmerpenguins)
    df <-  penguins
    head(df)
    # A tibble: 6 × 8
      species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
      <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
    1 Adelie  Torgersen           39.1          18.7               181        3750
    2 Adelie  Torgersen           39.5          17.4               186        3800
    3 Adelie  Torgersen           40.3          18                 195        3250
    4 Adelie  Torgersen           NA            NA                  NA          NA
    5 Adelie  Torgersen           36.7          19.3               193        3450
    6 Adelie  Torgersen           39.3          20.6               190        3650
    # ℹ 2 more variables: sex <fct>, year <int>
  • 데이터 탐색을 하라 (EDA : str, summary 이용)

    Code
    str(df)
    tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
     $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
     $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
     $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
     $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
     $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
     $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
     $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
     $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
    Code
    summary(df)
          species          island    bill_length_mm  bill_depth_mm  
     Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
     Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
     Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                     Mean   :43.92   Mean   :17.15  
                                     3rd Qu.:48.50   3rd Qu.:18.70  
                                     Max.   :59.60   Max.   :21.50  
                                     NA's   :2       NA's   :2      
     flipper_length_mm  body_mass_g       sex           year     
     Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
     1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
     Median :197.0     Median :4050   NA's  : 11   Median :2008  
     Mean   :200.9     Mean   :4202                Mean   :2008  
     3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
     Max.   :231.0     Max.   :6300                Max.   :2009  
     NA's   :2         NA's   :2                                 
  • NA가 있는 열 확인하라

    Code
    colSums(is.na(df))
              species            island    bill_length_mm     bill_depth_mm 
                    0                 0                 2                 2 
    flipper_length_mm       body_mass_g               sex              year 
                    2                 2                11                 0 
  • 컬럼명에서 _mm 제거하고 6개 행 보기(rename 이용 )

    Code
    library(tidyverse)
    df |> rename(bill_length = bill_length_mm,
                 bill_depth = bill_depth_mm,
                 flipper_length = flipper_length_mm) |>
      head()
    # A tibble: 6 × 8
      species island   bill_length bill_depth flipper_length body_mass_g sex    year
      <fct>   <fct>          <dbl>      <dbl>          <int>       <int> <fct> <int>
    1 Adelie  Torgers…        39.1       18.7            181        3750 male   2007
    2 Adelie  Torgers…        39.5       17.4            186        3800 fema…  2007
    3 Adelie  Torgers…        40.3       18              195        3250 fema…  2007
    4 Adelie  Torgers…        NA         NA               NA          NA <NA>   2007
    5 Adelie  Torgers…        36.7       19.3            193        3450 fema…  2007
    6 Adelie  Torgers…        39.3       20.6            190        3650 male   2007
  • Adelie 펭귄의 부리 길이 평균은 얼마일까?

    Code
    df |> 
       rename(bill_length = bill_length_mm,
                 bill_depth = bill_depth_mm,
                 flipper_length = flipper_length_mm) |>
      filter(species =="Adelie")  |> 
      summarise("부리길이" = mean(bill_length, na.rm=T))
    # A tibble: 1 × 1
      부리길이
         <dbl>
    1     38.8
  • 각 펭귄의 부리 길이, 부리 높이의 평균 구하라(소수 첫째자리까지 구하라)

    Code
    df |> 
       rename(bill_length = bill_length_mm,
                 bill_depth = bill_depth_mm,
                 flipper_length = flipper_length_mm) |>
      group_by(species) %>% summarise("부리길이"=round(mean(bill_length, na.rm=T),1), "부리높이"=round(mean(bill_depth, na.rm=T),1))
    # A tibble: 3 × 3
      species   부리길이 부리높이
      <fct>        <dbl>    <dbl>
    1 Adelie        38.8     18.3
    2 Chinstrap     48.8     18.4
    3 Gentoo        47.5     15  
  • 펭귄 종류별 몇마리인가

    Code
    df |> 
       rename(bill_length = bill_length_mm,
                 bill_depth = bill_depth_mm,
                 flipper_length = flipper_length_mm) |>
      group_by(species) %>%
      summarise(n=n())
    # A tibble: 3 × 2
      species       n
      <fct>     <int>
    1 Adelie      152
    2 Chinstrap    68
    3 Gentoo      124
  • 펭귄종류, 부리길이, 부리높이 열만 선택해서 보여줘라 (6개 행)

    Code
    df |> 
       rename(bill_length = bill_length_mm,
                 bill_depth = bill_depth_mm,
                 flipper_length = flipper_length_mm) |>
      select(species, bill_length, bill_depth) %>% head()
    # A tibble: 6 × 3
      species bill_length bill_depth
      <fct>         <dbl>      <dbl>
    1 Adelie         39.1       18.7
    2 Adelie         39.5       17.4
    3 Adelie         40.3       18  
    4 Adelie         NA         NA  
    5 Adelie         36.7       19.3
    6 Adelie         39.3       20.6
  • 10행에서 15행을 보여주라.

    Code
    df %>% slice(10:15)
    # A tibble: 6 × 8
      species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
      <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
    1 Adelie  Torgersen           42            20.2               190        4250
    2 Adelie  Torgersen           37.8          17.1               186        3300
    3 Adelie  Torgersen           37.8          17.3               180        3700
    4 Adelie  Torgersen           41.1          17.6               182        3200
    5 Adelie  Torgersen           38.6          21.2               191        3800
    6 Adelie  Torgersen           34.6          21.1               198        4400
    # ℹ 2 more variables: sex <fct>, year <int>
  • 새로운 변수를 만들어라 (bill_ratio = bill_lenght/bill_depth) : mutate

    Code
    df |> 
       rename(bill_length = bill_length_mm,
                 bill_depth = bill_depth_mm,
                 flipper_length = flipper_length_mm) |>
      mutate(bill_ratio=bill_length/bill_depth) |> 
      head()
    # A tibble: 6 × 9
      species island   bill_length bill_depth flipper_length body_mass_g sex    year
      <fct>   <fct>          <dbl>      <dbl>          <int>       <int> <fct> <int>
    1 Adelie  Torgers…        39.1       18.7            181        3750 male   2007
    2 Adelie  Torgers…        39.5       17.4            186        3800 fema…  2007
    3 Adelie  Torgers…        40.3       18              195        3250 fema…  2007
    4 Adelie  Torgers…        NA         NA               NA          NA <NA>   2007
    5 Adelie  Torgers…        36.7       19.3            193        3450 fema…  2007
    6 Adelie  Torgers…        39.3       20.6            190        3650 male   2007
    # ℹ 1 more variable: bill_ratio <dbl>
  • 위 문제에서 NA 가 있는 행은 제거하고 보여줘라

    Code
    df |> 
       rename(bill_length = bill_length_mm,
                 bill_depth = bill_depth_mm,
                 flipper_length = flipper_length_mm) |>
      mutate(bill_ratio=bill_length/bill_depth) |> 
      na.omit() %>% head()
    # A tibble: 6 × 9
      species island   bill_length bill_depth flipper_length body_mass_g sex    year
      <fct>   <fct>          <dbl>      <dbl>          <int>       <int> <fct> <int>
    1 Adelie  Torgers…        39.1       18.7            181        3750 male   2007
    2 Adelie  Torgers…        39.5       17.4            186        3800 fema…  2007
    3 Adelie  Torgers…        40.3       18              195        3250 fema…  2007
    4 Adelie  Torgers…        36.7       19.3            193        3450 fema…  2007
    5 Adelie  Torgers…        39.3       20.6            190        3650 male   2007
    6 Adelie  Torgers…        38.9       17.8            181        3625 fema…  2007
    # ℹ 1 more variable: bill_ratio <dbl>
  • Adelie, Chinstrap 펭귄의 각각 body_mass가 가장 작은 10개의 평균 부리길이(bill_length)를 구해서 두 평균 차이를 계산하라

    Code
    avg1 <- df |> 
       rename(bill_length = bill_length_mm,
                 bill_depth = bill_depth_mm,
                 flipper_length = flipper_length_mm) |>
      filter(species=="Adelie") |> 
      arrange(body_mass_g) |> 
      slice(1:10) |> 
      summarise(bl=mean(bill_length))
    avg2 <- df |> 
       rename(bill_length = bill_length_mm,
                 bill_depth = bill_depth_mm,
                 flipper_length = flipper_length_mm) |>
      filter(species=="Chinstrap") |> 
      arrange(body_mass_g) |> 
      slice(1:10) |> 
      summarise(bl=mean(bill_length))
    
    result<- abs(avg1$bl-avg2$bl)
    print(result)
    [1] 9.67
  • 부리 길이(bill_length) 중 최빈값(가장 많은 수)을 찾아라.

    Code
    df |> 
       rename(bill_length = bill_length_mm,
                 bill_depth = bill_depth_mm,
                 flipper_length = flipper_length_mm) |>
      select(bill_length) |> 
      table()  -> y
    
    names(y)[which(y==max(y))] 
    [1] "41.1"

4-5. 숙제

Note

Data : gapminder 연도별, 나라별 기대수명, 인구수, 1인당 GDP

library(gapminder) 로 데이터 불러오기

문제
  1. 2007년 대륙별 나라수는 몇 나라인가?

  2. 가장 최근 연도에서 인구수가 많은 상위 10개 나라를 뽑아서 나라별 인구수와 기대 수명을 구하라. (이때 인구수는 13.2억명, 기대수명은 73세로 단위를 맞추어라. ) )

  3. 연도별 기대수명이 가장 빠르게 증가한 나라 10개를 순서대로 나열하시오. (1952년, 2007년 비교)

  4. 2002년도 대륙별 1인당 gpd의 평균과 표준편차는 어떻게 되는가?

  5. 기대수명 데이터를 표준화(평균 0, 표준편차 1) 하라.

  6. Kuwait 를 제외하고, 1인당 gpd 데이터를 정규화(1과 0 사이로 만듦) 하라

Hint

정규화 함수 nor_minmax = function(x){ result = (x - min(x)) / (max(x) - min(x)) return(result) }

표준화 함수 nor_sd = function(x){ result = (x - mean(x)) / sd(x) return(result) }

  1. gapminder |> filter(year == 2007) |> group_by(continent) |> summarise(n= n())
  2. gapminder |> filter(year == 2007) |> arrange(-pop) |> slice(1:10) |> group_by(country) |> summarise(인구수_억명 = round(pop/100000000,1), 기대수명_세 = round(lifeExp) ) |> arrange(-인구수_억명)
  3. gapminder |> select(country, year, lifeExp) |> filter(year %in% c(1952, 2007)) |> pivot_wider(names_from = year, values_from = lifeExp) |> mutate(ratio = (2007- 1952)/(2007-1952)) |> arrange(-ratio)
  4. gapminder |> filter(year == 2002) |> group_by(continent) |> summarise(avg = mean(gdpPercap, na.rm=T), σ= sd(gdpPercap, na.rm=T))
  5. nor_sd = function(x){ result = (x - mean(x)) / sd(x) return(result) }

gapminder |> mutate(life_nor = nor_sd(lifeExp) )

  1. nor_minmax = function(x){ result = (x - min(x)) / (max(x) - min(x)) return(result) }

gapminder |>

filter(country != “Kuwait”) |> mutate(gdp_sd = nor_minmax(gdpPercap) )


Day2 (Homework)

아래는 데이터 전처리 예제입니다. 문제와 답만 있습니다. 한주 동안 풀어보시고 다음 강의 (7/13 토)에 각자 나누어서 어떻게 풀었는지 설명하는 시간을 갖도록 하겠습니다.

1 airquality

airquality 데이터 셋

5월부터 9월까지 Ozone(오존), Solar(uv), Wind(풍속), Temp(온도)에 관한 데이터세트이다.

Code
library(tidyverse)
head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

1. 열별 결측치가 몇개가 있는지 표시하라.

Code
colSums(is.na(airquality))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0 

2. 월별 Ozone의 평균과 Wind의 표준편차를 구하시오.

Code
airquality |> group_by(Month) |> summarise(Ozone_평균 = mean(Ozone, na.rm=T),  Wind_표준편차 = sd(Wind, na.rm=T))
# A tibble: 5 × 3
  Month Ozone_평균 Wind_표준편차
  <int>      <dbl>         <dbl>
1     5       23.6          3.53
2     6       29.4          3.77
3     7       59.1          3.04
4     8       60.0          3.23
5     9       31.4          3.46

3. 온도는 화씨로 되어 있는데, 섭씨 온도 Temp_C 를 새로운 열로 만들어라. 이때 소수 둘째자리에서 반올림해서 첫째자리까지 보이고 섭씨온도가 가장 높은 날은 몇월 몇일, 몇도인지 표시하라.

섭씨 = (화씨 − 32) × 5/9

Code
airquality |> 
  mutate(Temp_C = round((Temp-32)*5/9,1)) |> 
  arrange(-Temp_C) |> 
  slice(1) |> 
  select(Month, Day, Temp_C)
  Month Day Temp_C
1     8  28   36.1
Code
airquality |> 
  filter(Temp == max(Temp)) |>
  mutate(Temp_C = round((Temp-32)*5/9,1)) |> 
  select(Month, Day, Temp_C)
  Month Day Temp_C
1     8  28   36.1

4. Solar.R이 150 이상인 날 중에 8월~9월 총 몇일이나 되는가

Code
airquality |> 
  filter(Solar.R >= 150) |> 
  filter(Month %in% c(8, 9)) |> 
  count()
   n
1 38

5. Ozone이 결측치가 있는 날 중에 월별 Wind의 세기의 중간값을 구하시오.

Code
airquality |> 
  filter(is.na(Ozone)) |> 
  group_by(Month) |> summarise(Wind_중간값= median(Wind, na.rm=T))
# A tibble: 5 × 2
  Month Wind_중간값
  <int>       <dbl>
1     5        14.3
2     6         9.2
3     7        10.9
4     8        11.5
5     9        13.2

2 diamonds

diamonds 데이터 셋

가격: 미국 달러 가격.

캐럿: 다이아몬드의 무게.

절단: 절단 품질(최악의 순서).

색: 다이아몬드의 색상(가장 나쁜 순서).

선명도: 다이아몬드의 선명도(최악의 순서).

x: 길이(mm).

y: 너비(mm).

z: 깊이(mm). 깊이:

총 깊이 백분율: 100 * z / 평균(x, y)

테이블: 가장 넓은 지점을 기준으로 다이아몬드 상단의 너비입니다.

Code
head(diamonds)
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

1. 열별 영어로 된 이름을 한글로 바꾸어라.

     캐럿 = carat,
     절단 = cut,
     색 = color,
     선명도 =clarity,
     깊이 = depth,
     상단너비 = table,
     가격 = price
Code
diamonds |> 
  rename(캐럿 = carat,
         절단 = cut,
= color,
         선명도 =clarity,
         깊이 = depth,
         상단너비 = table,
         가격 = price)
# A tibble: 53,940 × 10
    캐럿 절단      색    선명도  깊이 상단너비  가격     x     y     z
   <dbl> <ord>     <ord> <ord>  <dbl>    <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2     61.5       55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1     59.8       61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1     56.9       65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2     62.4       58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2     63.3       58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2    62.8       57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1    62.3       57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1     61.9       55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2     65.1       61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1     59.4       61   338  4     4.05  2.39
# ℹ 53,930 more rows

2. 가격이 평균보다 작은 다이아몬드를 clarity별로 몇개씩 있는지 구하라.

Code
diamonds |> 
  filter(price < mean(price) ) |> 
  group_by(clarity) |> 
  summarise(n=n())
# A tibble: 8 × 2
  clarity     n
  <ord>   <int>
1 I1        480
2 SI2      4346
3 SI1      7612
4 VS2      7893
5 VS1      5510
6 VVS2     3856
7 VVS1     3098
8 IF       1488

3. 깊이백분율 열을 아래 수식에 맞도록 새로 만든 후 head()를 쓴 후 , depth와 깊이백분율 열만 보이도록 하라. 이때 깊이백분율은 소수 첫째자리까지만 보이라.

**깊이백분율 = z / (x와 y의 평균) *100 **

Code
diamonds |> 
  mutate(깊이백분율 = round(z / ((x+y)/2)*100,1)) |> 

  head() |> 
  select(깊이백분율, depth)
# A tibble: 6 × 2
  깊이백분율 depth
       <dbl> <dbl>
1       61.3  61.5
2       59.8  59.8
3       56.9  56.9
4       62.4  62.4
5       63.3  63.3
6       62.8  62.8

4. color별로 carat의 평균과 price의 중간값을 보여라. carat은 소수 둘째자리까지만 보이고, price의 내림차순으로 정렬하라.

Code
diamonds |> 
  group_by(color) |> 
  summarise(carat평균 = round(mean(carat),2),
            price중간값 = median(price)) |> 
  arrange(-price중간값)
# A tibble: 7 × 3
  color carat평균 price중간값
  <ord>     <dbl>       <dbl>
1 J          1.16       4234 
2 I          1.03       3730 
3 H          0.91       3460 
4 F          0.74       2344.
5 G          0.77       2242 
6 D          0.66       1838 
7 E          0.66       1739 

5. cut이 Premium 인 것중에서 carat이 가장 큰 값을 가지는 diamond의 가격은 얼마인가?

Code
diamonds |>
  filter(cut == "Premium") |> 
  filter(carat == max(carat)) |> 
  select(price) |> 
  distinct(price)
# A tibble: 1 × 1
  price
  <int>
1 15223

3 Titanic

Titanic 데이터 셋

PassengerId: 각 승객에게 주어진 고유 ID 번호
Survived: 승객이 생존(1)했는지 사망(0)했는지 여부
Pclass: 승객 등급
Name: 이름
Sex: 승객의 성별
Age: 승객의 나이
SibSp: 형제자매/배우자의 수
Parch: 부모/자녀의 수
Ticket: 티켓 번호
Fare: 티켓에 대해 지불한 금액
Cabin: 객실 카테고리
Embarked: 승객이 탑승한 항구(C = Cherbourg, Q = Queenstown, S = Southampton)

r에서는 타이타닉 데이터를 좀더 간편하게 만든 내장데이터가 있다. data를 아래와 같이 불러와서 titanic 변수에 넣고 시작하자.

titanic <- as.data.frame(Titanic)

Code
titanic <- as.data.frame(Titanic)
head(titanic)
  Class    Sex   Age Survived Freq
1   1st   Male Child       No    0
2   2nd   Male Child       No    0
3   3rd   Male Child       No   35
4  Crew   Male Child       No    0
5   1st Female Child       No    0
6   2nd Female Child       No    0

1. 탑승자 중 여자 아이의 총 수는 몇명인가?

Code
titanic |> 
  filter(Sex == "Female" & Age == "Child") |> 
  summarise(n = sum(Freq))
   n
1 45

2. Crew중 여자 어른의 수는 몇명인가?

Code
titanic |> 
  filter(Sex == "Female" & Class == "Crew") |> 
  summarise(n = sum(Freq))
   n
1 23

3.Sex별, Age별 생존자가 몇명인지 보이시오.

Code
titanic |> 
  filter(Survived == "Yes") |> 
  group_by(Sex, Age ) |> 
  summarise(생존자 = sum(Freq))
# A tibble: 4 × 3
# Groups:   Sex [2]
  Sex    Age   생존자
  <fct>  <fct>  <dbl>
1 Male   Child     29
2 Male   Adult    338
3 Female Child     28
4 Female Adult    316

4. 위 문제에서 Sex별, Age별 생존자의 비율은 얼마인가?

Code
titanic |> 
  group_by(Sex, Age ) |> 
  summarise(인원수 = sum(Freq)) -> titanic1

titanic |> 
  filter(Survived == "Yes") |> 
  group_by(Sex, Age ) |> 
  summarise(생존자 = sum(Freq)) -> titanic2

left_join(titanic1, titanic2) |> 
  mutate(생존율 = round(생존자 / 인원수 * 100))
# A tibble: 4 × 5
# Groups:   Sex [2]
  Sex    Age   인원수 생존자 생존율
  <fct>  <fct>  <dbl>  <dbl>  <dbl>
1 Male   Child     64     29     45
2 Male   Adult   1667    338     20
3 Female Child     45     28     62
4 Female Adult    425    316     74

5. Class별 생존율을 구하시오.

Code
titanic |> 
  group_by(Class) |> 
  summarise(인원수 = sum(Freq)) -> titanic3

titanic |> 
  filter(Survived == "Yes") |> 
  group_by(Class ) |> 
  summarise(생존자 = sum(Freq)) -> titanic4

left_join(titanic3, titanic4) |> 
  mutate(생존율 = round(생존자 / 인원수 * 100))
# A tibble: 4 × 4
  Class 인원수 생존자 생존율
  <fct>  <dbl>  <dbl>  <dbl>
1 1st      325    203     62
2 2nd      285    118     41
3 3rd      706    178     25
4 Crew     885    212     24



4 날짜 다루기

To learn more about lubridate see https://lubridate.tidyverse.org/.

  • 패키지 설치, 불러오기
Code
#install.packages('lubridate')
library('lubridate')
  • 문자로 표현된 날짜를 날짜변수로 바꾸기
Code
date <- '2020-01-10'
class(date)
[1] "character"
Code
date2 <- as.Date(date)
class(date2)
[1] "Date"
  • 연, 월, 일 뽑아내기
Code
year(date)
[1] 2020
Code
month(date)
[1] 1
Code
day(date)
[1] 10
Code
ymd(date)
[1] "2020-01-10"
  • 주, 요일 뽑아내기
Code
week(date)
[1] 2
Code
wday(date)
[1] 6
Code
wday(date, label = T)
[1] 금
Levels: 일 < 월 < 화 < 수 < 목 < 금 < 토
  • 시간, 분, 초 뽑아내기
Code
now()
[1] "2024-07-19 23:51:12 KST"
Code
time <- now()
hour(time)
[1] 23
Code
minute(time)
[1] 51
Code
second(time)
[1] 12.24399
Code
ymd_hms(time)
[1] "2024-07-19 23:51:12 UTC"

5 강수량 분석

[출처] 1주차 예상문제 (실기1 유형) (이기적 스터디 카페)

dataurl = https://raw.githubusercontent.com/Datamanim/datarepo/main/weather/weather2.csv

  • 패키지로드, 데이터 불러오기
Code
library(tidyverse)

df<-read.csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/weather/weather2.csv")

  • Q1. 여름철(6월,7월,8월) 이화동이 수영동보다 높은 기온을 가진 시간대는 몇개인가?
Code
#Q1

library(lubridate)

df |> 
  mutate(월 = month(time),
         시간 = hour(time)) |> 
  filter(월 %in% c(6,7,8),
         이화동기온 > 수영동기온) |> 
  nrow()
[1] 1415
  • Q2. 이화동과 수영동의 최대강수량의 시간대를 각각 구하여라
Code
#Q2

df |> 
  filter(이화동강수 == max(이화동강수 ) ) |> 
  select(time)
                 time
1 2020-09-30 09:00:00
Code
df |> 
  filter(수영동강수 == max(수영동강수)) |> 
  select(time)
                 time
1 2020-07-23 12:00:00

데이터불러오기

To learn more about tidyr see https://tidyr.tidyverse.org/reference/pivot_longer.html/.

데이터 분석의 첫 걸음은 데이터를 불러오는 과정이다.

  1. R의 내장 데이터에서 불러오기

    data() , help(“AirPassengers”)

https://vincentarelbundock.github.io/Rdatasets/datasets.html

Code
data(AirPassengers)
AirPassengers
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
Code
plot(AirPassengers, main = "Airline Passengers Over Time",
     xlab = "Year-Month", ylab = "Number of Passengers")

  1. 외장데이터 불러오기 (package 설치, library로 불러오기)

    gapminder : 세계 여러 국가의 인구, 경제, 건강 등의 데이터를 포함

Code
#install.packages("gapminder")
library(gapminder)

data(gapminder)
head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
  1. 클릭보드(엑셀)에서 붙여넣기

    datapaste 패키지 설치 -> 엑셀에서 ctrl+c -> RStudio의 Addins에서 Paste as tribble

  1. csv 파일에서 불러오기
Code
#| eval: true

 #  read.csv ("D:/r/data/test.csv")       ## **/** 방향 주의
 #  read.csv ("D:\\r\\data\\test.csv")    ## **\\** 방향 주의
  1. 엑셀파일 불러오기 https://readxl.tidyverse.org/
Code
#| eval: true

#   install.packages('readxl')
#   library(readxl)
#   read_excel("my_file.xls")
  1. 구글시트에서 불러오기

[참고] https://googlesheets4.tidyverse.org/

Code
#install.packages("googlesheets4")
library(googlesheets4)
gs4_deauth()

df <- read_sheet("https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w/edit?gid=0#gid=0")

head(df)
# A tibble: 6 × 5
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE      
         <dbl> <chr>            <chr>              <chr>               <list>   
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          <dbl [1]>
2            2 Barclay Lynn     French fries       Lunch only          <dbl [1]>
3            3 Jayendra Lyne    N/A                Breakfast and lunch <dbl [1]>
4            4 Leon Rossini     Anchovies          Lunch only          <NULL>   
5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch <chr [1]>
6            6 Güvenç Attila    Ice cream          Lunch only          <dbl [1]>
  1. NA 처리하기 https://tidyr.tidyverse.org/reference/fill.html
Code
sales <- tibble::tribble(
  ~quarter, ~year, ~sales,
  "Q1",    2000,    66013,
  "Q2",      NA,    69182,
  "Q3",      NA,    53175,
  "Q4",      NA,    21001,
  "Q1",    2001,    46036,
  "Q2",      NA,    58842,
  "Q3",      NA,    44568,
  "Q4",      NA,    50197,
  "Q1",    2002,    39113,
  "Q2",      NA,    41668,
  "Q3",      NA,    30144,
  "Q4",      NA,    52897,
  "Q1",    2004,    32129,
  "Q2",      NA,    67686,
  "Q3",      NA,    31768,
  "Q4",      NA,    49094
)


# `fill()` defaults to replacing missing data from top to bottom
sales %>% fill(year, .direction = "down")
# A tibble: 16 × 3
   quarter  year sales
   <chr>   <dbl> <dbl>
 1 Q1       2000 66013
 2 Q2       2000 69182
 3 Q3       2000 53175
 4 Q4       2000 21001
 5 Q1       2001 46036
 6 Q2       2001 58842
 7 Q3       2001 44568
 8 Q4       2001 50197
 9 Q1       2002 39113
10 Q2       2002 41668
11 Q3       2002 30144
12 Q4       2002 52897
13 Q1       2004 32129
14 Q2       2004 67686
15 Q3       2004 31768
16 Q4       2004 49094
  1. NA를 평균, 중앙값으로 대체하기
Code
head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
Code
colSums(is.na(airquality))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0 
Code
airquality |> 
  mutate(Ozone = ifelse(is.na(Ozone), mean(Ozone, na.rm=T), Ozone),
         Solar.R = ifelse(is.na(Ozone), median(Ozone, na.rm=T), Solar.R)) -> airquality2

colSums(is.na(airquality2))
  Ozone Solar.R    Wind    Temp   Month     Day 
      0       7       0       0       0       0 

시각화 하기

[ggplot 갤러리] The R Graph Gallery – Help and inspiration for R charts (r-graph-gallery.com)

[한국 R 사용자회 – 챗GPT 데이터 시각화 (r2bit.com)] https://r2bit.com/bitSlide/chatgpt_viz_202406.html#/데이터-시각화

[참고 자료] https://waterfirst.quarto.pub/r_course/#/title-slide

[참고 자료] https://rstudio.github.io/cheatsheets/html/data-visualization.html

Code
df <- tibble::tribble(
  ~angle,  ~`4.3`,  ~`3.8`,  ~`3.3`,  ~`2.8`,  ~`2.3`,  ~`1.8`,  ~`1.3`,
      0L,   0.999,   0.999,       1,       1,       1,       1,       1,
      5L,       1,       1,   0.999,   0.988,   0.963,   0.923,    0.88,
     10L,    0.91,   0.866,   0.821,   0.774,    0.73,   0.685,    0.64,
     15L,   0.668,   0.621,   0.577,   0.533,    0.49,   0.449,   0.407,
     20L,   0.424,   0.382,   0.339,   0.294,   0.252,   0.207,   0.162,
     25L,   0.182,   0.139,   0.096,   0.056,   0.028,   0.014,   0.011,
     30L,   0.011,    0.01,    0.01,   0.009,   0.009,    0.01,   0.009,
     35L,   0.008,   0.008,   0.008,   0.008,   0.008,   0.008,   0.008,
     40L,   0.007,   0.007,   0.007,   0.007,   0.007,   0.007,   0.007,
     45L,   0.006,   0.006,   0.005,   0.006,   0.005,   0.003,   0.002,
     50L,   0.005,   0.005,   0.004,   0.003,   0.002,   0.001,   0.001,
     55L,   0.006,   0.003,   0.002,   0.001,   0.001,   0.001,       0,
     60L,   0.005,   0.002,   0.001,   0.001,   0.001,       0,       0,
     65L,   0.004,   0.003,   0.001,   0.001,       0,       0,       0,
     70L,   0.003,   0.002,   0.002,       0,       0,       0,       0
  )
head(df)
# A tibble: 6 × 8
  angle `4.3` `3.8` `3.3` `2.8` `2.3` `1.8` `1.3`
  <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     0 0.999 0.999 1     1     1     1     1    
2     5 1     1     0.999 0.988 0.963 0.923 0.88 
3    10 0.91  0.866 0.821 0.774 0.73  0.685 0.64 
4    15 0.668 0.621 0.577 0.533 0.49  0.449 0.407
5    20 0.424 0.382 0.339 0.294 0.252 0.207 0.162
6    25 0.182 0.139 0.096 0.056 0.028 0.014 0.011

1. 산점도 그래프

Code
library(tidyverse)

df %>% pivot_longer(-1, names_to = "space", values_to = "value") %>% 
  
  ggplot(aes(x=angle, y=value, col=space))+
  geom_point()+
  geom_smooth(se=F, method = "gam")+
  theme_bw()

Code
df %>% pivot_longer(-1, names_to = "space", values_to = "value") %>% 
  
  ggplot(aes(x=angle, y=value, col=space))+
  geom_point()+
  geom_smooth(se=F, method = "gam")+
  theme_bw()+
  facet_wrap(~space, labeller = label_both)

2. 막대 그래프

Code
df %>% pivot_longer(-1, names_to = "space", values_to = "value") %>%  
  mutate(space = as.numeric(space)) %>% 
  filter(angle %in% c(0, 10, 15)) %>% 
  mutate(angle = as.factor(angle)) %>% 
  ggplot(aes(x=space, y=value*100, label=value*100, fill=angle))+
  geom_col(position="dodge")+
  
  geom_text(aes(label = value*100, y=value*100+3), position = position_dodge(0.5))+
  theme_bw()+
  labs(y="normalized value(%)", x="space 이격거리")

Code
df %>% pivot_longer(-1, names_to = "space", values_to = "value") %>%  
  mutate(space = as.numeric(space)) %>% 
  filter(angle %in% c(0, 10, 15)) %>% 
  mutate(angle = as.factor(angle)) %>% 
  ggplot(aes(x=space, y=value*100, label=value*100, fill=angle))+
  geom_col(position="dodge")+
  geom_label(position= position_dodge(0.4))+
  theme_bw()+
  labs(y="normalized value(%)", x="space 이격거리")

Code
df %>% pivot_longer(-1, names_to = "space", values_to = "value") %>%  
  mutate(space = as.factor(space)) %>% 
  filter(angle >45) %>% 
  ggplot(aes(x=space, y=value*100,  fill=space))+
  geom_boxplot()+
  theme_bw()+
  labs(y="normalized value(%)", x="space 이격거리")

  • 평균값 넣기
Code
p <- df %>% pivot_longer(-1, names_to = "space", values_to = "value") %>%  
  mutate(space = as.factor(space)) %>% 
  filter(angle <25) %>% 
  ggplot(aes(x=space, y=value*100,  fill=space))+
  geom_boxplot()+
  theme_bw()+
  labs(y="normalized value(%)", x="space 이격거리")

fun_mean <- function(x){
  return(data.frame(y=mean(x),label=round(mean(x,na.rm=T),1)))}

p+
  stat_summary(fun.data = fun_mean, geom="text", vjust=-0.7, position=position_dodge(0.8))+
  stat_summary(fun.y = mean, geom="point", size=1)

[참고자료]https://ggplot2.tidyverse.org/reference/geom_boxplot.html

4. Color 팔레트

Code
library(RColorBrewer)
display.brewer.all()

사용법 :

scale_fill_brewer(palette=“Set1”)

scale_colour_brewer(palette=“Set1”)

[Color Pick Up](https://r-graph-gallery.com/ggplot2-color.html)

[Colorspace 패키지](https://m.blog.naver.com/regenesis90/222234511150)

[Sci-Fi](https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html)

5. 테마

[theme](https://ggplot2.tidyverse.org/reference/ggtheme.html)

Day3 (Homework)

viloin 그래프

[참고자료]https://r-charts.com/es/distribucion/grafico-violin-grupo-ggplot2/

Code
# install.packages("ggplot2")
library(tidyverse)
head(warpbreaks)
  breaks wool tension
1     26    A       L
2     30    A       L
3     54    A       L
4     25    A       L
5     70    A       L
6     52    A       L
Code
str(warpbreaks)
'data.frame':   54 obs. of  3 variables:
 $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...
 $ wool   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
 $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
Code
summary(warpbreaks)
     breaks      wool   tension
 Min.   :10.00   A:27   L:18   
 1st Qu.:18.25   B:27   M:18   
 Median :26.00          H:18   
 Mean   :28.15                 
 3rd Qu.:34.00                 
 Max.   :70.00                 
Code
warpbreaks |> ggplot(aes(x = tension, y = breaks, fill = tension)) +
  geom_violin(trim = F) +
  geom_boxplot(width = 0.07) 

Density 그래프

[참고자료]https://r-charts.com/es/distribucion/grafico-densidad-grupo-ggplot2/

Code
# Datos
set.seed(5)
x <- c(rnorm(200, mean = -2, 1.5),
       rnorm(200, mean = 0, sd = 1),
       rnorm(200, mean = 2, 1.5))
group <- c(rep("A", 200), rep("B", 200), rep("C", 200))
df <- data.frame(x, group)

head(df)
            x group
1 -3.26128322     A
2  0.07653902     A
3 -3.88323779     A
4 -1.89478585     A
5  0.56716131     A
6 -2.90436197     A
Code
# install.packages("ggplot2")
library(ggplot2)

cols <- c("#F76D5E", "#FFFFBF", "#72D8FF")

# Gráfico de densidad en ggplot2
df |> ggplot(aes(x = x, fill = group)) +
  geom_density(alpha = 0.7) + 
  scale_fill_manual(values = cols) 

boxplot+density+point

[참고자료]https://mjskay.github.io/ggdist/

Code
library(ggdist)
library(tidyverse)
library(tidyquant)

head(mpg)
# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
Code
mpg %>% 
  filter(cyl %in% c(4,6,8)) %>% 
  ggplot(aes(x=factor(cyl), y=hwy, fill=factor(cyl)))+
  ggdist::stat_halfeye(
    adjust=0.5,
    justification= -.2,
    .width = 0,
    width=0.4,
    point_colour=NA
  )+
  ggdist::stat_dots(
    side="left",
    justification = 1.1,
    binwidth = .25
  )+
  scale_fill_tq()+
  theme_tq()+
  labs(title="Raincloud Plot",
       subtitle = "showing the bi-modal distribution of 6 cylinder vehicle",
       x="engine size",
       y="highway fuel economy",
       fill= "cylinders")+
  #coord_flip()+
  geom_boxplot(
    width=.12,
    outlier.color = NA,
    alpha=0.5
  )

Pair 그래프

[참고자료]https://r-charts.com/es/correlacion/ggpairs/

Code
# install.packages("GGally")
library(GGally)

ggpairs(iris)  

Code
# install.packages("GGally")
library(GGally)

ggpairs(iris, columns = 1:4, aes(color = Species, alpha = 0.5),
        upper = list(continuous = "points")) 

Sankey 그래프

[참고자료]https://r-charts.com/es/flujo/diagrama-sankey-ggplot2/

Code
# install.packages("remotes")
# remotes::install_github("davidsjoberg/ggsankey")

library(ggsankey)
df <- mtcars %>%
  make_long(cyl, vs, am, gear, carb) 

# install.packages("remotes")
# remotes::install_github("davidsjoberg/ggsankey")
library(ggsankey)
# install.packages("ggplot2")
library(ggplot2)
# install.packages("dplyr")
library(dplyr) # Necesario

ggplot(df, aes(x = x, 
               next_x = next_x, 
               node = node, 
               next_node = next_node,
               fill = factor(node),
               label = node)) +
  geom_sankey(flow.alpha = 0.5, node.color = 1) +
  geom_sankey_label(size = 3.5, color = 1, fill = "white") +
  scale_fill_viridis_d() +
  theme_sankey(base_size = 16) +
  theme(legend.position = "none") 

그래프 분할하기

  • facet_grid
Code
#create data frame
df <- data.frame(team=c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'),
                 position=c('G', 'G', 'F', 'F', 'G', 'G', 'G', 'G'),
                 points=c(8, 14, 20, 22, 25, 29, 30, 31),
                 assists=c(10, 5, 5, 3, 8, 6, 9, 12))

ggplot(df, aes(assists, points)) +
  geom_point() +
  facet_grid(position~team)

  • facet_warp
Code
ggplot(df, aes(assists, points)) +
  geom_point() +
  facet_wrap(position~team)

Patchwork 패키지

Code
library(patchwork)

p1 <- ggplot(mtcars) + geom_point(aes(mpg, disp))
p2 <- ggplot(mtcars) + geom_boxplot(aes(gear, disp, group = gear))

p1 + p2

Code
p3 <- ggplot(mtcars) + geom_smooth(aes(disp, qsec))
p4 <- ggplot(mtcars) + geom_bar(aes(carb))

(p1 | p2 | p3) /
      p4

논문용 Theme

Code
theme_Publication <- function(base_size=14, base_family="helvetica") {
  library(grid)
  library(ggthemes)
  (theme_foundation(base_size=base_size, base_family=base_family)
    + theme(plot.title = element_text(face = "bold",
                                      size = rel(1.2), hjust = 0.5),
            text = element_text(),
            panel.background = element_rect(colour = NA),
            plot.background = element_rect(colour = NA),
            panel.border = element_rect(colour = NA),
            axis.title = element_text(face = "bold",size = rel(1)),
            axis.title.y = element_text(angle=90,vjust =2),
            axis.title.x = element_text(vjust = -0.2),
            axis.text = element_text(), 
            axis.line = element_line(colour="black"),
            axis.ticks = element_line(),
            panel.grid.major = element_line(colour="#f0f0f0"),
            panel.grid.minor = element_blank(),
            legend.key = element_rect(colour = NA),
            legend.position = "bottom",
            legend.direction = "horizontal",
            legend.key.size= unit(0.2, "cm"),
            legend.margin = unit(0, "cm"),
            legend.title = element_text(face="italic"),
            plot.margin=unit(c(10,5,5,5),"mm"),
            strip.background=element_rect(colour="#f0f0f0",fill="#f0f0f0"),
            strip.text = element_text(face="bold")
    ))
  
}

scale_fill_Publication <- function(...){
  library(scales)
  discrete_scale("fill","Publication",manual_pal(values = c("#386cb0","#fdb462","#7fc97f","#ef3b2c","#662506","#a6cee3","#fb9a99","#984ea3","#ffff33")), ...)
  
}

scale_colour_Publication <- function(...){
  library(scales)
  discrete_scale("colour","Publication",manual_pal(values = c("#386cb0","#fdb462","#7fc97f","#ef3b2c","#662506","#a6cee3","#fb9a99","#984ea3","#ffff33")), ...)
  
}

Plotly 그래프

Code
library(ggrepel)


temp.dat <- structure(list(Year = c("2003", "2004", "2005", "2006", "2007", 
                                    "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2003", 
                                    "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", 
                                    "2012", "2013", "2014", "2003", "2004", "2005", "2006", "2007", 
                                    "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2003", 
                                    "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", 
                                    "2012", "2013", "2014"), State = structure(c(1L, 1L, 1L, 1L, 
                                                                                 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                                                                                 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
                                                                                 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("VIC", 
                                                                                                                                             "NSW", "QLD", "WA"), class = "factor"), Capex = c(5.35641472365348, 
                                                                                                                                                                                               5.76523240652641, 5.24727577535625, 5.57988239709746, 5.14246402568366, 
                                                                                                                                                                                               4.96786288162828, 5.493190785287, 6.08500616799372, 6.5092228474591, 
                                                                                                                                                                                               7.03813541623157, 8.34736513875897, 9.04992300432169, 7.15830329914056, 
                                                                                                                                                                                               7.21247045701994, 7.81373928617117, 7.76610217197542, 7.9744994967006, 
                                                                                                                                                                                               7.93734452080786, 8.29289899132255, 7.85222269563982, 8.12683746325074, 
                                                                                                                                                                                               8.61903784301649, 9.7904327253813, 9.75021175267288, 8.2950673974226, 
                                                                                                                                                                                               6.6272705639724, 6.50170524635367, 6.15609626379471, 6.43799637295979, 
                                                                                                                                                                                               6.9869551384028, 8.36305663640294, 8.31382617231745, 8.65409824343971, 
                                                                                                                                                                                               9.70529678167458, 11.3102788081848, 11.8696420977237, 6.77937303542605, 
                                                                                                                                                                                               5.51242844820827, 5.35789621712839, 4.38699327451101, 4.4925792218211, 
                                                                                                                                                                                               4.29934654081527, 4.54639175257732, 4.70040615159951, 5.04056109514957, 
                                                                                                                                                                                               5.49921208937735, 5.96590909090909, 6.18700407463007)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                           -48L), .Names = c("Year", "State", "Capex"))


head(temp.dat)
  Year State    Capex
1 2003   VIC 5.356415
2 2004   VIC 5.765232
3 2005   VIC 5.247276
4 2006   VIC 5.579882
5 2007   VIC 5.142464
6 2008   VIC 4.967863
Code
library(ggplot2)
library(ggrepel)
library(dplyr)


p <- temp.dat %>%
  mutate(label = if_else(Year == max(Year), as.character(State), NA_character_)) %>%
  ggplot(aes(x = Year, y = Capex, group = State, colour = State, shape=State)) + 
  geom_line() + geom_point()+
  geom_label_repel(aes(label = label),
                   nudge_x = 1,
                   na.rm = TRUE)+
  scale_colour_Publication()+ theme_Publication()+
  theme(legend.position = "none")
p

Code
library(plotly)
ggplotly(p)

그래프 저장

Code
p

Code
ggsave("myplot.png")

p2 <- ggplotly(p)
htmlwidgets::saveWidget(p2, "myplot.html")

실전 팁

Note

xyz 로 이루어진 데이터이미지화 하기기 주로 gradual map 이미지로 표현

Code
df <- read.csv("https://raw.githubusercontent.com/waterfirst/Data_visualization/main/xyz.csv", sep="\t")


head(df)
    X          V1         V21          V41         V61        V81        V101
1   1 -1337.41406 -1367.09180 -1401.021484 -1435.35352  -255.8105    90.50391
2  21 -1170.75586 -1201.98828 -1241.388672  -187.56445   141.4883   122.69531
3  41 -1235.54102 -1254.79102  -203.886719   138.48438   125.3984 -1333.99414
4  61 -1461.77930  -289.81445    70.949219    41.79688 -1516.9668 -1492.18555
5  81  -313.12109    42.24219     3.839844 -1542.40820 -1542.9258 -1521.15430
6 101    42.46094  -316.14258 -1498.402344 -1482.32227 -1480.4180 -1463.98828
       V121      V141      V161      V181      V201      V221        V241
1    52.875 -1472.014 -1518.787 -1536.834 -1533.326 -1101.887    24.87109
2 -1331.297 -1360.314 -1404.381 -1429.262 -1427.475 -1422.768  -844.32227
3 -1339.826 -1327.027 -1364.625 -1379.879 -1373.289 -1358.213 -1392.21875
4 -1496.389 -1481.912 -1507.918 -1497.352 -1502.889 -1497.029 -1520.81445
5 -1521.072 -1511.488 -1535.635 -1532.748 -1534.363 -1532.615 -1544.00586
6 -1462.914 -1459.406 -1483.430 -1494.395 -1506.535 -1501.504  -323.27930
         V261       V281        V301        V321        V341        V361
1    26.28125 -954.02734 -1470.94922 -1425.98047 -1410.35352 -1399.28906
2    58.83984  -60.17188  -691.29102 -1253.54492 -1267.41016 -1221.70703
3  -682.13086   42.39453    74.14453  -949.02539 -1287.09180 -1240.88867
4 -1507.81836 -316.11719  1159.39844    -3.96875  -999.91992 -1454.92383
5  -363.40039  -65.80078  -340.77734   -66.72266    13.09375 -1251.93945
6    17.94531 -349.52344 -1479.83203  -572.57812    34.05859    13.52734
        V381         V401         V421         V441        V461        V481
1 -1407.7930 -1426.691406  -816.183594   170.671875 -1530.59766  -470.96875
2 -1210.6816 -1226.542969 -1259.244141  -188.871094    68.32031   163.68359
3 -1235.8965 -1252.113281 -1287.478516     5.054688    47.89844   -96.00391
4 -1479.8750 -1498.388672  -248.117188    27.949219  -215.24023 -1430.28516
5 -1553.6367  -490.083984     1.609375  -330.289063 -1484.54688 -1606.51172
6   -12.1875    -5.667969  -317.759766 -1400.449219 -1521.87109 -1549.57812
        V501        V521        V541         V561        V581       V601
1   518.8125  -323.81641 -1429.90430 -1575.992188 -1598.43945 -1593.9297
2  1204.3359  -338.88281 -1476.27539 -1505.312500 -1532.32031 -1530.1387
3  -295.0254    42.44922  -295.01367 -1397.550781 -1421.82617 -1413.4121
4 -1545.0332  -351.15039    18.14062  -363.304688 -1552.30859 -1561.4160
5 -1582.2793 -1570.34961  -361.78711    -9.164063  -407.02734 -1598.2090
6 -1496.2090 -1485.60547 -1494.91797   -23.050781   -16.51172  -407.9746
       V621        V641         V661        V681        V701       V721
1 -1025.338   -21.57812   -28.421875  -968.12109 -1570.11523  125.16797
2 -1520.510  -665.84766    -1.128906  -114.89844    39.73828 -585.83008
3 -1395.627 -1428.10156  -295.171875   -11.97266  -224.63281 -788.78320
4 -1547.758 -1575.64648 -1580.324219  -728.16406   -97.46875 -313.14844
5 -1593.164 -1618.70508 -1618.599609 -1650.11133  -750.56445  -51.80469
6 -1522.412 -1556.14062 -1548.394531 -1548.96094 -1517.97852 -404.95898
          V741       V761       V781
1  -866.136719 -1474.8105 -1486.8203
2 -1320.425781 -1287.2891 -1273.5488
3 -1338.429688 -1294.9023 -1274.4980
4 -1056.027344 -1485.5293 -1505.5762
5  -221.890625 -1385.7188 -1605.2246
6    -1.902344  -244.5625  -421.5684
Code
df |>   pivot_longer(cols = 1:40, names_to = "x", values_to = "z") %>%
  mutate(y= rep(seq(1:40), 30),  x= rep(seq(1:30), each=40), z=(z-min(z))/10000) -> df1
Code
df1 %>% ggplot(aes(x=x, y=y, fill=z))+geom_tile()+
  scale_fill_gradientn(colours=c("navy","blue", "green", "yellow", "orange", "red"))+
  labs(x="x", y="y", title=paste("3D Profile by WSI")) +
  theme_bw() + theme(axis.text.x=element_text(size=9, angle=0, vjust=0.3),
                     axis.text.y=element_text(size=9),
                     plot.title=element_text(size=11))

Code
df1 |>  ggplot(aes(x=x, y=y, fill=z)) +
  geom_raster()+
    scale_fill_viridis_c()

Code
require(akima)
require(rgl)


filled.contour(x=c(1:nrow(df)),
               y=c(1:ncol(df)),
               z=as.matrix(df),
               color.palette=colorRampPalette(c("blue","yellow","red")),
               plot.title=title(main="mm" ,
                                sub= ""  ,
                                xlab="", ylab=""),
               nlevels=50,
               plot.axes = { axis(side = 2, at = nrow(df), labels = "", col.lab="white")
                 axis(side = 1, at = ncol(df), labels = ncol(df), col.lab="white") },
               key.title=title(main="T(%)"),
               key.axes = axis(4, seq(0, 8, by = 0.1))) 

3d grahp

https://www.rayshader.com/reference/plot_gg.html

Code
library(rayshader)
mtplot_density = ggplot(mtcars) + 
 stat_density_2d(aes(x=mpg,y=disp, fill=after_stat(!!str2lang("density"))), 
                 geom = "raster", contour = FALSE) +
 scale_x_continuous(expand=c(0,0)) +
 scale_y_continuous(expand=c(0,0)) +
 scale_fill_gradient(low="pink", high="red")
mtplot_density

Code
plot_gg(mtplot_density, width = 4,zoom = 0.60, theta = -45, phi = 30, 
       windowsize = c(1400,866))
render_snapshot()

Polar contour graph

Code
df <- read.csv("./data/polar_data.csv")
head(df)
  Theta        X0       X10       X20      X30       X40      X50       X60
1     0 149.66808 149.66808 149.66808 149.6681 149.66808 149.6681 149.66808
2    10 149.95407 149.91832 150.61545 150.0434 149.85576 150.6601 150.55289
3    20 148.55981 148.56876 149.07819 149.3999 150.59758 151.8131 152.48340
4    30 136.09195 134.68875 132.22198 140.9897 142.48228 142.2946 144.97586
5    40 104.89992 105.52555 103.47885 109.1274 110.43226 111.3439 111.27239
6    50  70.18652  71.83996  72.85884  75.3971  75.33453  76.3981  76.78242
        X70       X80       X90      X100     X110      X120     X130      X140
1 149.66808 149.66808 149.66808 149.66808 149.6681 149.66808 149.6681 149.66808
2 150.48138 150.29370 151.11595 151.05339 150.8568 151.11595 150.8121 150.81207
3 153.72571 154.31558 154.27090 153.72571 154.1011 153.90446 153.3414 155.54004
4 146.61144 145.97687 144.54686 145.99474 144.7703 142.08009 140.7841 139.73846
5 112.89008 112.97946 112.21977 111.69245 111.2456 110.11945 108.5643 107.14325
6  76.58579  75.77248  76.66623  76.32661  76.0406  75.87973  74.8966  73.55597
       X150      X160      X170      X180      X190      X200      X210
1 149.66808 149.66808 149.66808 149.66808 149.66808 149.66808 149.66808
2 150.65120 150.49033 150.01663 149.41782 149.27481 149.92726 150.69589
3 152.85877 152.81409 150.54395 150.44563 150.94614 151.07126 151.66113
4 138.58551 138.05820 137.80795 137.27170 137.66495 137.38789 138.46933
5 105.19486 104.73011 103.00517 102.01310 102.29016 103.63973 104.36367
6  70.96408  69.22126  67.49631  67.46056  68.62244  69.45364  71.50033
       X220      X230      X240      X250      X260      X270      X280
1 149.66808 149.66808 149.66808 149.66808 149.66808 149.66808 149.66808
2 151.28577 152.02757 152.89452 152.76045 152.19739 152.69789 151.36620
3 151.91139 151.56282 152.65321 152.86771 153.35034 153.39502 153.52014
4 139.93509 141.46341 141.24890 140.69478 140.40877 140.06021 141.25784
5 105.35574 106.52656 106.97343 107.54543 107.70631 103.08560 104.27429
6  72.47452  73.35934  73.57384  73.25209  74.26203  74.30672  75.05747
       X290      X300      X310      X320      X330     X340      X350
1 149.66808 149.66808 149.66808 149.66808 149.66808 149.6681 149.66808
2 151.40195 151.83989 150.83888 150.40988 149.92726 150.7406 150.57970
3 153.70784 152.94815 152.50127 152.30464 151.01764 150.3295 149.39101
4 141.34721 140.80203 139.70271 140.23003 138.43358 137.4505 137.29851
5 108.76988 108.60900 108.27831 109.83344 106.73212 106.0350 104.83736
6  75.31666  75.83504  76.10317  75.03066  73.48447  71.9919  70.81214
       X360
1 149.66808
2 149.95407
3 148.55981
4 136.09195
5 104.89992
6  70.18652
Code
library(tidyverse)

library(stringr)
library(akima)
library(showtext) # 한글
library(viridis) #특정 color 묶음
library(patchwork)
showtext_auto()

df |>   pivot_longer(-1, names_to = "Phi", values_to = "L", names_prefix = "X") |> mutate(Phi = as.numeric(Phi)) -> df1


PolarImagePlot <- function(Mat, outer.radius = 1, ppa = 5, cols, breaks, nbreaks = 51, axes = TRUE, circle.rads){

  # the image prep
  Mat      <- Mat[, ncol(Mat):1]
  radii    <- ((0:ncol(Mat)) / ncol(Mat)) * outer.radius

  # 5 points per arc will usually do
  Npts     <- ppa
  # all the angles for which a vertex is needed
  radians  <- 2 * pi * (0:(nrow(Mat) * Npts)) / (nrow(Mat) * Npts) + pi / 2
  # matrix where each row is the arc corresponding to a cell
  rad.mat  <- matrix(radians[-length(radians)], ncol = Npts, byrow = TRUE)[1:nrow(Mat), ]
  rad.mat  <- cbind(rad.mat, rad.mat[c(2:nrow(rad.mat), 1), 1])

  # the x and y coords assuming radius of 1
  y0 <- sin(rad.mat)
  x0 <- cos(rad.mat)

  # dimension markers
  nc <- ncol(x0)
  nr <- nrow(x0)
  nl <- length(radii)

  # make a copy for each radii, redimension in sick ways
  x1 <- aperm( x0 %o% radii, c(1, 3, 2))
  # the same, but coming back the other direction to close the polygon
  x2 <- x1[, , nc:1]
  #now stick together
  x.array <- abind:::abind(x1[, 1:(nl - 1), ], x2[, 2:nl, ], matrix(NA, ncol = (nl - 1), nrow = nr), along = 3)
  # final product, xcoords, is a single vector, in order,
  # where all the x coordinates for a cell are arranged
  # clockwise. cells are separated by NAs- allows a single call to polygon()
  xcoords <- aperm(x.array, c(3, 1, 2))
  dim(xcoords) <- c(NULL)
  # repeat for y coordinates
  y1 <- aperm( y0 %o% radii,c(1, 3, 2))
  y2 <- y1[, , nc:1]
  y.array <- abind:::abind(y1[, 1:(length(radii) - 1), ], y2[, 2:length(radii), ], matrix(NA, ncol = (length(radii) - 1), nrow = nr), along = 3)
  ycoords <- aperm(y.array, c(3, 1, 2))
  dim(ycoords) <- c(NULL)

  # sort out colors and breaks:
  if (!missing(breaks) & !missing(cols)){
    if (length(breaks) - length(cols) != 1){
      stop("breaks must be 1 element longer than cols")
    }
  }
  if (missing(breaks) & !missing(cols)){
    breaks <- seq(min(Mat,na.rm = TRUE), max(Mat, na.rm = TRUE), length = length(cols) + 1)
  }
  if (missing(cols) & !missing(breaks)){
    cols <- rev(heat.colors(length(breaks) - 1))
    #cols <- rev(rainbow(16)[1:(length(breaks) - 1)])
    #cols <- rev(rainbow((length(breaks) - 1)+15)[1:(length(breaks) - 1)])

  }
  if (missing(breaks) & missing(cols)){
    breaks <- seq(min(Mat,na.rm = TRUE), max(Mat, na.rm = TRUE), length = nbreaks)
    #cols <- rev(heat.colors(length(breaks) - 1))
    #cols <- rev(rainbow((length(breaks) - 1)))
    cols <- rev(rainbow((length(breaks) - 1)+15)[1:(length(breaks) - 1)])

  }

  # get a color for each cell. Ugly, but it gets them in the right order
  cell.cols <- as.character(cut(as.vector(Mat[nrow(Mat):1,ncol(Mat):1]), breaks = breaks, labels = cols))

  # start empty plot
  plot(NULL, type = "n", ylim = c(-1, 1) * outer.radius, xlim = c(-1, 1) * outer.radius, asp = 1, axes = FALSE, xlab = "", ylab = "",   xaxt='n',  yaxt='n')
  # draw polygons with no borders:
  polygon(xcoords, ycoords, col = cell.cols, border = NA)

  if (axes){

    # a couple internals for axis markup.

    RMat <- function(radians){
      matrix(c(cos(radians), sin(radians), -sin(radians), cos(radians)), ncol = 2)
    }

    circle <- function(x, y, rad = 1, nvert = 500){
      rads <- seq(0,2*pi,length.out = nvert)
      xcoords <- cos(rads) * rad + x
      ycoords <- sin(rads) * rad + y
      cbind(xcoords, ycoords)
    }
    # draw circles
    if (missing(circle.rads)){
      circle.rads <- pretty(radii)
    }
    for (i in circle.rads){
      lines(circle(0, 0, i), col = "#66666650")
    }

    # put on radial spoke axes:
    axis.rads <- c(pi/2, pi/3, pi/6, 0, 5*pi/6, 2*pi/3)
    #, 0, pi / 6, pi / 3 , pi / 2, 2 * pi / 3, 5 * pi / 6 )
    r.labs <- c(90, 60, 30, 0, 330, 300)
    l.labs <- c(270, 240, 210, 180, 150, 120)

    for (i in 1:length(axis.rads)){
      endpoints <- zapsmall(c(RMat(axis.rads[i]) %*% matrix(c(1, 0, -1, 0) * outer.radius,ncol = 2)))
      segments(endpoints[1], endpoints[2], endpoints[3], endpoints[4], col = "#66666650")
      endpoints <- c(RMat(axis.rads[i]) %*% matrix(c(1.1, 0, -1.1, 0) * outer.radius, ncol = 2))
      lab1 <- bquote(.(r.labs[i]) * degree)
      lab2 <- bquote(.(l.labs[i]) * degree)
      text(endpoints[1], endpoints[2], lab1, xpd = TRUE)
      text(endpoints[3], endpoints[4], lab2, xpd = TRUE)
    }
    axis(2, pos = -1.2 * outer.radius, at = sort(union(circle.rads,-circle.rads)))
  }
  invisible(list(breaks = breaks, col = cols))
}


Interp_ref <- akima:::interp(
  x =df1$Phi, 
  y = df1$Theta, 
  z = df1$L,
  extrap = TRUE,
  xo = c(seq(270, 360, length.out = 75), seq(0, 270, length.out = 225)),
  yo = seq(0, 90, length.out = 100),
  linear = FALSE
)

Mat_ref <- Interp_ref[[3]]

PolarImagePlot(Mat_ref)

엑셀 데이터 한번에 불러오기

여러 엑셀 파일 한번에 불러오기

Code
library(readxl)
library(purrr)

setwd("D:/r/유형별 r 예제/Data_visualization")

list.files("data/gapminder")
 [1] "1952.xlsx" "1957.xlsx" "1962.xlsx" "1967.xlsx" "1977.xlsx" "1982.xlsx"
 [7] "1987.xlsx" "1992.xlsx" "1997.xlsx" "2002.xlsx" "2007.xlsx"
Code
list.files("data/gapminder", pattern = "[.]xlsx$",full.names = TRUE)
 [1] "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx"
 [3] "data/gapminder/1962.xlsx" "data/gapminder/1967.xlsx"
 [5] "data/gapminder/1977.xlsx" "data/gapminder/1982.xlsx"
 [7] "data/gapminder/1987.xlsx" "data/gapminder/1992.xlsx"
 [9] "data/gapminder/1997.xlsx" "data/gapminder/2002.xlsx"
[11] "data/gapminder/2007.xlsx"
Code
paths <- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)


files <- map(paths,read_excel)
length(files)
[1] 11
Code
class(files)
[1] "list"
Code
list_rbind(files)
# A tibble: 1,562 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         28.8  8425333      779.
 2 Albania     Europe       55.2  1282697     1601.
 3 Algeria     Africa       43.1  9279525     2449.
 4 Angola      Africa       30.0  4232095     3521.
 5 Argentina   Americas     62.5 17876956     5911.
 6 Australia   Oceania      69.1  8691212    10040.
 7 Austria     Europe       66.8  6927772     6137.
 8 Bahrain     Asia         50.9   120447     9867.
 9 Bangladesh  Asia         37.5 46886859      684.
10 Belgium     Europe       68    8730405     8343.
# ℹ 1,552 more rows
Code
paths |> 
  map(read_excel) |> 
  list_rbind()
# A tibble: 1,562 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         28.8  8425333      779.
 2 Albania     Europe       55.2  1282697     1601.
 3 Algeria     Africa       43.1  9279525     2449.
 4 Angola      Africa       30.0  4232095     3521.
 5 Argentina   Americas     62.5 17876956     5911.
 6 Australia   Oceania      69.1  8691212    10040.
 7 Austria     Europe       66.8  6927772     6137.
 8 Bahrain     Asia         50.9   120447     9867.
 9 Bangladesh  Asia         37.5 46886859      684.
10 Belgium     Europe       68    8730405     8343.
# ℹ 1,552 more rows
Code
paths |> 
  set_names(basename) |> 
  map(read_excel) |> 
  list_rbind(names_to = "year") |> 
  mutate(year = parse_number(year))
# A tibble: 1,562 × 6
    year country     continent lifeExp      pop gdpPercap
   <dbl> <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1  1952 Afghanistan Asia         28.8  8425333      779.
 2  1952 Albania     Europe       55.2  1282697     1601.
 3  1952 Algeria     Africa       43.1  9279525     2449.
 4  1952 Angola      Africa       30.0  4232095     3521.
 5  1952 Argentina   Americas     62.5 17876956     5911.
 6  1952 Australia   Oceania      69.1  8691212    10040.
 7  1952 Austria     Europe       66.8  6927772     6137.
 8  1952 Bahrain     Asia         50.9   120447     9867.
 9  1952 Bangladesh  Asia         37.5 46886859      684.
10  1952 Belgium     Europe       68    8730405     8343.
# ℹ 1,552 more rows
Code
gapminder <- paths |> 
  set_names(basename) |> 
  map(read_excel) |> 
  list_rbind(names_to = "year") |> 
  mutate(year = parse_number(year))

#write_csv(gapminder, "gapminder.csv")


# 여러 csv 파일 한번에 읽어오기
# alltrips <- list.files(pattern = "\\.csv$") %>% map_df(~read_csv(.))

하나의 엑셀, 여러 시트

Code
library(readxl)
library(purrr)

setwd("D:/r/유형별 r 예제/Data_visualization/data/")

file_1<- structure(list(file = c("D:/r/유형별 r 예제/Data_visualization/data/gapmider.xlsx"), sheet = excel_sheets("gapmider.xlsx")))

map2(file_1$file, file_1$sheet, ~ read_excel(path = .x, sheet = .y, range = "A1:E143")) %>%
  list_rbind(names_to = "year") %>%
  mutate(year = rep(c(file_1$sheet), each = n()/length(file_1$sheet)))
# A tibble: 1,562 × 6
   year  country     continent lifeExp      pop gdpPercap
   <chr> <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 1952  Afghanistan Asia         28.8  8425333      779.
 2 1952  Albania     Europe       55.2  1282697     1601.
 3 1952  Algeria     Africa       43.1  9279525     2449.
 4 1952  Angola      Africa       30.0  4232095     3521.
 5 1952  Argentina   Americas     62.5 17876956     5911.
 6 1952  Australia   Oceania      69.1  8691212    10040.
 7 1952  Austria     Europe       66.8  6927772     6137.
 8 1952  Bahrain     Asia         50.9   120447     9867.
 9 1952  Bangladesh  Asia         37.5 46886859      684.
10 1952  Belgium     Europe       68    8730405     8343.
# ℹ 1,552 more rows

표로 데이터 보여주기

gt 패키지 이용하기

  1. 간단한 테이블 만들어보기 (나라별 섬 갯수)
Code
library(tidyverse)
library(gt)

df <- tibble(
    name = names(islands),
    size = islands
  ) 

df |> 
  arrange(-size)|>
  slice(1:10) |> 
  gt()
name size
Asia 16988
Africa 11506
North America 9390
South America 6795
Antarctica 5500
Europe 3745
Australia 2968
Greenland 840
New Guinea 306
Borneo 280

gt_parts_of_a_table 2) Title, subtitle 넣기

Code
df |> 
  arrange(-size)|>
  slice(1:10) |> 
  gt() |> 
    tab_header(
    title = "Large Landmasses of the World",
    subtitle = "The top ten largest are presented"
  )
Large Landmasses of the World
The top ten largest are presented
name size
Asia 16988
Africa 11506
North America 9390
South America 6795
Antarctica 5500
Europe 3745
Australia 2968
Greenland 840
New Guinea 306
Borneo 280
  1. 제목 꾸미기 (마크다운 문법 md)
Code
df |> 
  arrange(-size)|>
  slice(1:2) |> 
  gt() |> 
  tab_header(
    title = md("**Large Landmasses of the World**"),
    subtitle = md("The *top two* largest are presented")
  )
Large Landmasses of the World
The top two largest are presented
name size
Asia 16988
Africa 11506
  1. 바닥글에 출처 넣기(tab_source_note)
Code
df |> 
  arrange(-size)|>
  slice(1:10) |> 
  gt() |> 
  tab_header(
    title = md("**Large Landmasses of the World**"),
    subtitle = md("The *top two* largest are presented")
  ) |> 
    tab_source_note(
    source_note = "Source: The World Almanac and Book of Facts, 1975, page 406."
  ) |>
  tab_source_note(
    source_note = md("Reference: McNeil, D. R. (1977) *Interactive Data Analysis*. Wiley.")
  )
Large Landmasses of the World
The top two largest are presented
name size
Asia 16988
Africa 11506
North America 9390
South America 6795
Antarctica 5500
Europe 3745
Australia 2968
Greenland 840
New Guinea 306
Borneo 280
Source: The World Almanac and Book of Facts, 1975, page 406.
Reference: McNeil, D. R. (1977) Interactive Data Analysis. Wiley.
  1. 주석 넣기(tab_footnote)
Code
df |> 
  arrange(-size)|>
  slice(1:10) |> 
  gt() |> 
  tab_header(
    title = md("**Large Landmasses of the World**"),
    subtitle = md("The *top two* largest are presented")
  ) |> 
    tab_source_note(
    source_note = "Source: The World Almanac and Book of Facts, 1975, page 406."
  ) |>
  tab_source_note(
    source_note = md("Reference: McNeil, D. R. (1977) *Interactive Data Analysis*. Wiley.")
  ) |> 
  
    tab_footnote(
    footnote = "The Americas.",
    locations = cells_body(columns = name, rows = 3:4)
  ) |> 
  
  tab_footnote(
    footnote = "The largest by area.",
    locations = cells_body(
      columns = size,
      rows = size == max(size)
    )
  ) |>
  tab_footnote(
    footnote = "The lowest by area.",
    locations = cells_body(
      columns = size,
      rows = size == min(size)
    )
  )
Large Landmasses of the World
The top two largest are presented
name size
Asia 1 16988
Africa 11506
North America2 9390
South America2 6795
Antarctica 5500
Europe 3745
Australia 2968
Greenland 840
New Guinea 306
Borneo 3 280
Source: The World Almanac and Book of Facts, 1975, page 406.
Reference: McNeil, D. R. (1977) Interactive Data Analysis. Wiley.
1 The largest by area.
2 The Americas.
3 The lowest by area.
  1. table 저장하기
Code
df |> 
  arrange(-size)|>
  slice(1:10) |> 
  gt() |> 
  tab_header(
    title = md("**Large Landmasses of the World**"),
    subtitle = md("The *top two* largest are presented")
  ) |> 
    tab_source_note(
    source_note = "Source: The World Almanac and Book of Facts, 1975, page 406."
  ) |>
  tab_source_note(
    source_note = md("Reference: McNeil, D. R. (1977) *Interactive Data Analysis*. Wiley.")
  ) |> 
  
    tab_footnote(
    footnote = "The Americas.",
    locations = cells_body(columns = name, rows = 3:4)
  ) |> 
  
  tab_footnote(
    footnote = "The largest by area.",
    locations = cells_body(
      columns = size,
      rows = size == max(size)
    )
  ) |>
  tab_footnote(
    footnote = "The lowest by area.",
    locations = cells_body(
      columns = size,
      rows = size == min(size)
    )
  ) |> 
  #  gtsave(filename = "tab_1.html") |> 
  gtsave("tab_1.png", expand = 10)